Adding data from tables

There's a frequent situation when the total dataset is too big to be rebuilt from scratch often, but the amount of new records is rather small. Example: a forum with a 1,000,000 archived posts, but only 1,000 new posts per day.

In this case, "live" (almost real time) table updates could be implemented using so called "main+delta" scheme.

The idea is to set up two sources and two tables, with one "main" table for the data which only changes rarely (if ever), and one "delta" for the new documents. In the example above, 1,000,000 archived posts would go to the main table, and newly inserted 1,000 posts/day would go to the delta table. Delta table could then be rebuilt very frequently, and the documents can be made available to search in a matter of minutes. Specifying which documents should go to what table and rebuilding the main table could also be made fully automatic. One option would be to make a counter table which would track the ID which would split the documents, and update it whenever the main table is rebuilt.

Example: Fully automated live updates

# in MySQL
CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL
);
# in sphinx.conf
source main
{
    # ...
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
    sql_query = SELECT id, title, body FROM documents \
        WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
source delta : main
{
    sql_query_pre = SET NAMES utf8
    sql_query = SELECT id, title, body FROM documents \
        WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
table main
{
    source = main
    path = /path/to/main
    # ... all the other settings
}
**note how all other settings are copied from main, but source and path are overridden (they MUST be)**
table delta : main
{
    source = delta
    path = /path/to/delta
}

A better split variable is to use a timestamp column instead of the ID as timestamps can track not just new documents, but also modified ones.

For the datasets that can have documents modified or deleted, the delta table should also provide a list with documents that suffered changes in order to be suppressed and not be used in search queries. This is achieved with the feature called Kill lists. The document ids to be killed can be provided in an auxiliary query defined by sql_query_killlist. The delta must point the tables for which the kill-lists will be applied by directive killlist_target. The effect of kill-lists is permanent on the target table, meaning even if the search is made without the delta table, the suppressed documents will not appear in searches.

Note how we're overriding sql_query_pre in the delta source. We need to explicitly have that override. Otherwise REPLACE query would be run when building the delta source too, effectively nullifying it. However, when we issue the directive in the inherited source for the first time, it removes all inherited values, so the encoding setup is also lost. So sql_query_pre in the delta can not just be empty; and we need to issue the encoding setup query explicitly once again.

Merging tables

Last modified: January 18, 2023

Merging two existing plain tables can be more efficient than indexing the data from scratch and desired in some cases (such as merging 'main' and 'delta' tables instead of simply rebuilding 'main' in the 'main+delta' partitioning scheme). So indexer has an option to do that. Merging tables is normally faster than rebuilding, but still not instant on huge tables. Basically, it will need to read the contents of the both tables once and write the result once. Merging 100 GB and 1 GB table, for example, will result in 202 GB of I/O (but that's still likely less than the indexing from scratch requires).

The basic command syntax is as follows:

sudo -u manticore indexer --merge DSTINDEX SRCINDEX [--rotate] [--drop-src]

Unless --drop-src is specified only the DSTINDEX table will be affected: the contents of SRCINDEX will be merged into it.

--rotate switch will be required if DSTINDEX is already being served by searchd.

The typical usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX. Thus, when merging attributes the values from SRCINDEX will win if duplicate document IDs are encountered. Note, however, that the "old" keywords will not be automatically removed in such cases. For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated with it in SRCINDEX, document 123 will be found by both keywords after the merge. You can supply an explicit condition to remove documents from DSTINDEX to mitigate that; the relevant switch is --merge-dst-range:

sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0

This switch lets you apply filters to the destination table along with merging. There can be several filters; all of their conditions must be met in order to include the document in the resulting merged table. In the example above, the filter passes only those records where 'deleted' is 0, eliminating all records that were flagged as deleted.

--drop-src allows dropping SRCINDEX after the merge and before rotating the tables, which is important in case you specify DSTINDEX in killlist_target of DSTINDEX, otherwise when rotating the tables the documents that have been merged into DSTINDEX may be suppressed by SRCINDEX.

Main+delta schema Killlists in plain tables

Last modified: January 18, 2023

When using plain tables there is a problem generated by the need of having the data in the table as fresh as possible.

In this case one or more secondary (also know as delta) tables are used to capture the modified data between the time the main table was created and and current time. The modified data can mean new, updated or deleted documents. The search becomes a search over the main table and the delta table. This works with no obstacle when you just add new documents to the delta table, but when it comes to updated or deleted documents there remains the following issue.

If a document is present in both main and delta tables it can cause issues at searching as the engine will see two versions of a document and won't know how to pick the right one. So the delta needs to tell somehow to the search that there are deleted documents in the main table that should be forgotten. Here comes kill lists.

Table can maintain a list of document ids that can be used to suppress records in other tables. This feature is available for plain tables using database sources or plain tables using XML sources. In case of database sources, the source needs to provide an additional query defined by sql_query_killlist. It will store in the table a list of documents that can be used by the server to remove documents from other plain tables.

This query is expected to return a number of 1-column rows, each containing just the document ID.

In many cases the query is a union between a query that gets a list of updated documents and a list of deleted documents, e.g.:

sql_query_killlist = \
    SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
    SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex

A plain table can contain a directive called killlist_target that will tell the server it can provide a list of document ids that should be removed from certain existing tables. The table can use either it's document ids as the source for this list or provide a separate list.

Sets the table(s) that the kill-list will be applied to. Optional, default value is empty.

When you use plain tables you often need to maintain not a single table, but a set of them to be able to add/update/delete new documents sooner (read about delta table updates). In order to suppress matches in the previous (main) table that were updated or deleted in the next (delta) table you need to:

Create a kill-list in the delta table using sql_query_killlist
Specify main table as killlist_target in delta table settings:

‹›

CONFIG

CONFIG

📋

table products {
  killlist_target = main:kl
  path = products
  source = src_base
}

When killlist_target is specified, kill-list is applied to all the tables listed in it on searchd startup. If any of the tables from killlist_target are rotated, kill-list is reapplied to these tables. When kill-list is applied, tables that were affected save these changes to disk.

killlist_target has 3 modes of operation:

killlist_target = main:kl. Document ids from the kill-list of the delta table are suppressed in the main table (see sql_query_killlist).
killlist_target = main:id. All document ids from delta table are suppressed in the main table. Kill-list is ignored.
killlist_target = main. Both document ids from delta table and its kill-list are suppressed in the main table.

Multiple targets can be specified separated by comma like

killlist_target = table_one:kl,table_two:kl

You can change killlist_target settings for a table without rebuilding it by using ALTER.

But since the 'old' main table has already written the changes to disk, the documents that were deleted in it will remain deleted even if it is no longer in the killlist_target of the delta table.

‹›

SQL
HTTP

📋

ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'

Merging tables Attaching a plain table to RT table

Last modified: January 18, 2023

Main+delta schema

Adding data from tables

Merging tables

Killlist in plain tables

Table kill-list

Removing documents in a plain table

killlist_target